IN THE CLAIMS: 



Please amend the claims as follows. No new matter is introduced. 

1 . (Currently Amended) A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by the machine to perform 
method steps for speech synthesis, the method steps comprising: 

providing a text string comprising a plurality of words and phonemes and 
corresponding spoken audio signal; 

extracting acoustic feature data from said audio signal , wherein extracting 
acoustic feature data from said audio signal comprises digitizing the spoken audio signal 
into a set of frames and transforming the digitized input waveforms into a set of feature 
vectors on a frame-by-frame basis by producing a multi-dimensional cepstra feature 
vector for a predetermined intervals of the spoken audio signal, concatenating frames to 
the left and to the right of a current frame to augment a current cepstral vector, and 
reducing the dimension of each augmented cepstral vector using linear discriminant 
analysis ; 

aligning the text string and the acoustic feature data and outputting a set of 
duration contours indicative of the duration of each word and phoneme; 

extracting pitch contour parameters from said audio spoken input; 

automatically generating a marked-up text corresponding to the spoken utterance 
using the pitch and duration contours; and 

generating a synthetic waveform using the marked-up text. 

2-3. (Canceled) 

4. (Previously Presented) The program storage device of claim 1 , wherein the 
instructions for aligning comprise instructions for segmenting said spoken audio signal 
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into time-segmented regions, wherein each time-segmented region is mapped to a 
corresponding phoneme. 

5. (Previously Presented) The program storage device of claim 1 , wherein the 
alignment is performed using a Viterbi alignment process. 

6. (Canceled). 

7. (Previously Presented) The program storage device of claim 1, wherein the 
instructions for automatically generating a marked-up text comprise instruction for 
directly specifying the pitch contour and duration parameters as attribute values for mark- 
up elements. 

8. (Previously Presented) The program storage device of claim 1 , wherein the 
instructions for automatically generating a marked-up text comprise instructions for 
assigning abstract labels to the pitch contour and duration parameters to generate a high- 
level markup. 

9. (Original) The program storage device of claim 1 , wherein the marked-up 
text is generated using SSML (speech synthesis markup language). 

1 0. (Original) The program storage device of claim 1 , further comprising 
instruction for processing phonetic content of the spoken utterance to generate the 
synthetic waveform having a desired pronunciation. 

1 1 . (Currently Amended) A method for speech synthesis, comprising the steps 

of: 

providing a text string comprising a plurality of words and phonemes and 
corresponding spoken audio signal; 
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extracting acoustic feature data from said audio signa l wherein extracting 
acoustic feature data from said audio signal comprises digitizing the spoken audio signal 
into a set of frames and transforming the digitized input waveforms into a set of feature 
vectors on a frame-bv-frame basis by producing a multi-dimensional cepstra feature 
vector for a predetermined intervals of the spoken audio signal, concatenating frames to 
the left and to the right of a current frame to augment a current cepstral vector, and 
reducing the dimension of each augmented cepstral vector using linear discriminant 
analysis ; 

aligning the text string and the acoustic feature data and outputting a set of 
duration contours indicative of the duration of each word and phoneme; 

extracting pitch contour parameters from said audio spoken input; 

automatically generating a marked-up text corresponding to the spoken utterance 
using the pitch contour and duration parameters; and 

generating a synthetic waveform using the marked-up text. 

12-13. (Canceled) 

14. (Previously Presented) The method of claim 1 1 , wherein aligning 
comprises extracting acoustic feature data from the spoken utterance and time-aligning 
the spoken input to the corresponding text string using the acoustic feature data. 

1 5 . (Previously Presented) The method of claim 1 1 , wherein aligning is 
performed using a Viterbi alignment process. 

16. (Canceled) 

1 7. (Previously Presented) The method of claim 1 1 , wherein automatically 
generating a marked-up text comprises directly specifying the pitch contour and duration 
parameters as attribute values for mark-up elements. 
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1 8 . (Previously Presented) The method of claim 1 1 , wherein automatically 
generating a marked-up text comprises assigning abstract labels to the pitch contour and 
duration parameters to generate a high-level markup. 

1 9. (Original) The method of claim 1 1 , wherein the marked-up text is 
generated using SSML (speech synthesis markup language). 

20. (Original) The method of claim 1 1 , further comprising processing phonetic 
content of the spoken utterance to generate the synthetic waveform having a desired 
pronunciation. 

2 1 . (Currently Amended) A text-to-speech (TTS) system, comprising: 

a prosody analyzer for determining prosodic parameters of a spoken utterance 
corresponding to an input text string and automatically generating a marked-up text 
corresponding to the spoken utterance using the prosodic parameters, wherein the prosody 
analyzer comprises: 

an acoustic feature extraction module that extracts acoustic feature data 
from said spoken utterance, including digitizing the spoken audio signal into a set of 
frames and transforming the digitized input waveforms into a set of feature vectors on a 
frame-by-frame basis by producing a multi-dimensional cepstra feature vector for a 
predetermined intervals of the spoken audio signal, concatenating frames to the left and to 
the right of a current frame to augment a current cepstral vector, and reducing the 
dimension of each augmented cepstral vector using linear discriminant analysis ; 

an alignment module for aligning the input text string with the spoken 
utterance using said acoustic feature data to generate duration contour information of 
elements comprising the input text string; 

a pitch contour extraction module for determining pitch contour 
information for the spoken utterance; and 
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a conversion module for including markup in the input text string in 
accordance with the duration and pitch contour information to generate the marked up 
text; and 

a TTS system for generating a synthetic waveform using the marked-up text. 

22. (Original) The system of claim 2 1 , further comprising a user interface that 
enables a user to input the spoken utterance and input a text string corresponding to the 
spoken utterance. 

23 . (Original) The system of claim 2 1 , wherein the prosody analyzer processes 
phonetic content of the spoken utterance to generate the synthetic waveform having a 
desired pronunciation. 

24-28. (Canceled) 
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